A Pre-Processing Method to Deal with Missing Values by Integrating Clustering and Regression Techniques

نویسندگان

  • Vincent S. Tseng
  • Kuo-Ho Wang
  • Chien-I Lee
چکیده

Data pre-processing is a critical task in the knowledge discovery process in order to ensure the quality of the data to be analyzed. One widely studied problem in data pre-processing is the handling of missing values with the aim to recover its original value. Based on numerous studies on missing values, it is shown that different methods are needed for different types of missing data. In this work, we propose a new method to deal with missing values in data sets where cluster properties exist among the data records. By integrating the clustering and regression techniques, the proposed method can predict the missing values with higher accuracy. To our best knowledge, this is the first work combining regression and clustering analysis to deal with the missing values problem. Through empirical evaluation, the proposed method was shown to perform better than other methods under different types of data sets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A method to solve the problem of missing data, outlier data and noisy data in order to improve the performance of human and information interaction

Abstract Purpose: Errors in data collection and failure to pay attention to data that are noisy in the collection process for any reason cause problems in data-based analysis and, as a result, wrong decision-making. Therefore, solving the problem of missing or noisy data before processing and analysis is of vital importance in analytical systems. The purpose of this paper is to provide a metho...

متن کامل

A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm

Missing values in datasets should be extracted from the datasets or should be estimated before they are used for classification, association rules or clustering in the preprocessing stage of data mining. In this study, we utilize a fuzzy c-means clustering hybrid approach that combines support vector regression and a genetic algorithm. In this method, the fuzzy clustering parameters, cluster si...

متن کامل

Missing data imputation in multivariable time series data

Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...

متن کامل

Handling missing values in kernel methods with application to microbiology data

We discuss several approaches that make possible for kernel methods to deal with missing values. The first two are extended kernels able to handle missing values without data preprocessing methods. Another two methods are derived from a sophisticated multiple imputation technique involving logistic regression as local model learner. The performance of these approaches is compared using a binary...

متن کامل

Assessment of Clustering Methods for Predicting Permeability in a Heterogeneous Carbonate Reservoir

Permeability, the ability of rocks to flow hydrocarbons, is directly determined from core. Due to high cost associated with coring, many techniques have been suggested to predict permeability from the easy-to-obtain and frequent properties of reservoirs such as log derived porosity. This study was carried out to put clustering methods (dynamic clustering (DC), ascending hierarchical clustering ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Applied Artificial Intelligence

دوره 17  شماره 

صفحات  -

تاریخ انتشار 2003